Reproducibility¶
All code, data and detailed documentation for this analysis are available on GitHub.
You can clone or browse the repository here:
https://github.com/yrqoeuqo123/FinalProject_QTM151.git
Feel free to explore the notebooks, CSVs and instructions there to reproduce every step of the project.
team_members = [
"Amanda Middelthon",
"Anika Chandra",
"Cassiel Chen",
"Tianyi Zhang"
]
decorative_line = "~" * 60
print(decorative_line)
print("🌊 TEAM MEMBERS 🌊".center(60))
print(decorative_line)
for member in team_members:
print(f"~ {member.center(56)} ~")
print(decorative_line)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
🌊 TEAM MEMBERS 🌊
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Amanda Middelthon ~
~ Anika Chandra ~
~ Cassiel Chen ~
~ Tianyi Zhang ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Section 1: Introduction¶
Formula 1 (F1) is the world’s premier open-wheel racing series, where cutting-edge engineering, peak athlete performance, and split-second strategy collide. Since its inaugural season in 1950, F1 has grown into a truly global spectacle—spanning five continents and challenging drivers on a dizzying array of circuits, from high-altitude mountain tracks to tight street courses. Fans and analysts alike follow every qualifying session, pit stop and photo finish on Formula1.com and in encyclopedic resources such as Wikipedia’s Formula One page.
Figure 1: A modern Ferrari F1 car in action
We’re interested not just in who wins, but where and how those victories happen—and how the sport has become more (or less) geographically diverse over time. To that end, we’ll answer three focused questions:
RQ1: Which nationalities have produced the most podium finishes, and how has each nationality’s podium-rate evolved over the decades?
- Why?
- (a) Total podium counts reveal which countries dominate F1 history.
- (b) Podium-rate by decade uncovers shifts in global competitiveness and emerging talent markets.
- Tables:
Drivers,Results,Races - Sub-questions & Approach:
- Total podiums
- Join
drivers.driverId → results.driverId. - Cast
positionOrderto integer and flagpodium = (positionOrder ≤ 3). - Count podiums by
nationality.
- Join
- Decade-by-decade podium-rate
- Merge
results.raceId → races.raceIdto get race dates. - Derive each driver’s
debut_decade(first appearance rounded to nearest decade). - Flag
podiumas above; count entries and podiums pernationality×debut_decade. - Compute
podium_rate = podiums ÷ entries.
- Merge
- Total podiums
- Why?
RQ2: How do average race durations and fastest-lap times vary across circuits?
- Why? Circuit layout, elevation, and length all impact lap speeds and race pace—comparing these metrics highlights which tracks favor outright speed versus those that reward endurance and strategy.
- Tables:
Results,Circuits,Status - Approach:
- Filter to classified finishers (
status == "Finished"or matches"+n Laps"). - Convert
millisecondsandfastestLapTimeinto seconds. - Compute per-circuit averages for race duration, fastest-lap time, and finisher counts.
- Filter to classified finishers (
We will first load and inspect these tables (Section 2), then perform the cleaning, merging, and calculations needed for each question (Section 3), and conclude with an interpretation of our findings (Section 4).
Section 2: Data Description¶
For our analysis, we are focusing on a curated subset of the comprehensive Formula 1 dataset, which comprises 14 tables. We have selected five tables—Drivers, Races, Results, Circuits, and Status—because they offer a multi-dimensional view of the sport and allow us to filter for classified finishers where needed.
Drivers: Each row represents a unique Formula 1 driver and includes essential details such as the driver's name, nationality, and date of birth. This table provides the backbone for understanding personal attributes and backgrounds.
Races: Captures the details of every race event—race date, round, season, and circuit linkage—spanning the sport’s evolution. We use it to derive decade-of-debut and to join race-level metadata.
Results: Records each driver’s performance in a given race: starting grid, finishing position (
positionOrder), points, lap counts, total race time (milliseconds), fastest lap details, and astatusId. This is the core table for our performance, podium, and speed analyses.Circuits: Describes each race venue—its name, location, country, and altitude—enabling us to examine how track characteristics affect average race and lap times.
Status: Maps
statusIdto finishing status strings (e.g., “Finished”, “Accident”, “+1 Lap”). We join this to Results to distinguish true finishers from DNFs or technical retirements, ensuring our circuit-speed calculations and podium denominators are based only on classified outcomes.
2.1 Library Imports and Directory Setup¶
Before beginning our analysis, we need to import several Python libraries that will help us manipulate data, perform numerical computations, create visualizations, and map circuit locations. We also set up our environment by determining the current working directory and specifying the subfolder (data_raw) where our raw data is stored.
- Pandas: efficient data manipulation with DataFrame objects
- NumPy: numerical operations and computations
- Matplotlib and Seaborn: versatile plotting libraries for both basic and advanced visualizations
- OS: interact with the operating system to manage file paths
- Folium: generate interactive web maps for geographic analysis
- PercentFormatter from
matplotlib.ticker: format axis labels as percentages
Below is the code snippet that imports these libraries, configures our plotting settings, and sets the directory for our raw data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from matplotlib.ticker import PercentFormatter
import folium
current_dir = os.getcwd()
data_dir = os.path.join(current_dir, 'data_raw')
# print("Current working directory:", current_dir)
# print("Data directory:", data_dir)
2.2 Loading Datasets and Counting Observations¶
In this section, we load the selected Formula 1 datasets from our specified data_raw directory. We work with four key tables:
- Drivers: Contains information about each driver, such as name, nationality, and date of birth.
- Races: Records details of every race event, including the race date, round, and season.
- Results: Provides performance metrics for drivers in each race, such as finishing positions and points scored.
- Circuits: Holds details about the race circuits, including circuit name, location, and geographic coordinates.
The following code snippet reads each dataset from its corresponding CSV file and prints out the number of rows to ensure that the data has loaded correctly.
# Load the datasets and count the number of observations in each
drivers = pd.read_csv(os.path.join(data_dir, 'drivers.csv'))
races = pd.read_csv(os.path.join(data_dir, 'races.csv'))
results = pd.read_csv(os.path.join(data_dir, 'results.csv'))
circuits = pd.read_csv(os.path.join(data_dir, 'circuits.csv'))
status = pd.read_csv(os.path.join(data_dir, 'status.csv'))
for df, name in [(drivers, "Drivers"), (races, "Races"),
(results, "Results"), (circuits, "Circuits"),
(status, "Status")]:
print(f"{name}: {df.shape[0]} rows × {df.shape[1]} cols")
display(df.head(2))
Drivers: 857 rows × 9 cols
| driverId | driverRef | number | code | forename | surname | dob | nationality | url | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | hamilton | 44 | HAM | Lewis | Hamilton | 1985-01-07 | British | http://en.wikipedia.org/wiki/Lewis_Hamilton |
| 1 | 2 | heidfeld | \N | HEI | Nick | Heidfeld | 1977-05-10 | German | http://en.wikipedia.org/wiki/Nick_Heidfeld |
Races: 1102 rows × 18 cols
| raceId | year | round | circuitId | name | date | time | url | fp1_date | fp1_time | fp2_date | fp2_time | fp3_date | fp3_time | quali_date | quali_time | sprint_date | sprint_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2009 | 1 | 1 | Australian Grand Prix | 2009-03-29 | 06:00:00 | http://en.wikipedia.org/wiki/2009_Australian_G... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
| 1 | 2 | 2009 | 2 | 2 | Malaysian Grand Prix | 2009-04-05 | 09:00:00 | http://en.wikipedia.org/wiki/2009_Malaysian_Gr... | \N | \N | \N | \N | \N | \N | \N | \N | \N | \N |
Results: 25840 rows × 18 cols
| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 18 | 1 | 1 | 22 | 1 | 1 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616 | 39 | 2 | 1:27.452 | 218.300 | 1 |
| 1 | 2 | 18 | 2 | 2 | 3 | 5 | 2 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094 | 41 | 3 | 1:27.739 | 217.586 | 1 |
Circuits: 77 rows × 9 cols
| circuitId | circuitRef | name | location | country | lat | lng | alt | url | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | albert_park | Albert Park Grand Prix Circuit | Melbourne | Australia | -37.84970 | 144.968 | 10 | http://en.wikipedia.org/wiki/Melbourne_Grand_P... |
| 1 | 2 | sepang | Sepang International Circuit | Kuala Lumpur | Malaysia | 2.76083 | 101.738 | 18 | http://en.wikipedia.org/wiki/Sepang_Internatio... |
Status: 139 rows × 2 cols
| statusId | status | |
|---|---|---|
| 0 | 1 | Finished |
| 1 | 2 | Disqualified |
2.3 'Drivers' Dataset Exploration¶
Table Schema
| Field | Type | Null | Key | Default | Description |
|---|---|---|---|---|---|
driverId |
int(11) |
NO | PRI | — | Primary key (auto-increment) |
driverRef |
varchar(255) |
NO | — | Unique driver identifier | |
number |
int(11) |
YES | — | Permanent driver number | |
code |
varchar(3) |
YES | — | Driver code (e.g. “ALO”) | |
forename |
varchar(255) |
NO | — | Driver forename | |
surname |
varchar(255) |
NO | — | Driver surname | |
dob |
date |
YES | — | Driver date of birth | |
nationality |
varchar(255) |
YES | — | Driver nationality | |
url |
varchar(255) |
NO | UNI | — | Wikipedia page (unique) |
The Drivers table contains one row per F1 competitor (857 entries). The driverId field serves as the internal primary key, while driverRef uniquely identifies each driver externally. Permanent race numbers (number) and three-letter codes (code) are provided when available. Both forename and surname are required, ensuring clear naming, whereas dob and nationality may be null if unknown. Finally, the url column links to each driver’s Wikipedia page, offering a reliable reference for further biographical details. This schema gives us all the demographic and identity information needed for our nationality-based analyses.
Data Exploration¶
The Drivers dataset contains information for each unique Formula 1 driver, such as their name, nationality, and date of birth. In the code below, we display the first five rows, the detailed dataset information (data types and non-null counts), and summary statistics to understand the structure and composition of the data.
# Preview the first 5 rows of the Drivers dataset
print("Drivers dataset loaded with", drivers.shape[0], "rows.")
print("First 5 Rows of the Drivers Dataset:")
print(drivers.head())
Drivers dataset loaded with 857 rows. First 5 Rows of the Drivers Dataset: driverId driverRef number code forename surname dob \ 0 1 hamilton 44 HAM Lewis Hamilton 1985-01-07 1 2 heidfeld \N HEI Nick Heidfeld 1977-05-10 2 3 rosberg 6 ROS Nico Rosberg 1985-06-27 3 4 alonso 14 ALO Fernando Alonso 1981-07-29 4 5 kovalainen \N KOV Heikki Kovalainen 1981-10-19 nationality url 0 British http://en.wikipedia.org/wiki/Lewis_Hamilton 1 German http://en.wikipedia.org/wiki/Nick_Heidfeld 2 German http://en.wikipedia.org/wiki/Nico_Rosberg 3 Spanish http://en.wikipedia.org/wiki/Fernando_Alonso 4 Finnish http://en.wikipedia.org/wiki/Heikki_Kovalainen
Above is a snapshot of the first five rows from the Drivers dataset, which comprises 857 records in total. Each row represents a unique Formula 1 driver and includes key information such as the internal driver ID, an external reference, the driver's racing number (which may be absent in some cases), a short driver code, first name, last name, date of birth, nationality, and a URL linking to their Wikipedia page. For instance, the first entry in this dataset corresponds to Lewis Hamilton, one of the sport’s most celebrated figures. To honor his legacy, we also include an image of him below.
Figure 3: Picture of Lewis Hamilton - F1 Driver for Ferrari
# Display dataset information
print("Dataset Information:")
drivers.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 857 entries, 0 to 856 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 driverId 857 non-null int64 1 driverRef 857 non-null object 2 number 857 non-null object 3 code 857 non-null object 4 forename 857 non-null object 5 surname 857 non-null object 6 dob 857 non-null object 7 nationality 857 non-null object 8 url 857 non-null object dtypes: int64(1), object(8) memory usage: 60.4+ KB
Every column in the Drivers table is fully populated, with no missing rows. The only integer field, driverId, serves as our primary key for all merges. The other eight columns are stored as strings (or dates once converted), covering identifiers (driverRef), personal details (forename, surname, dob, nationality), race metadata (number, code), and reference links (url). Several of these—particularly number, code, dob, and nationality—use the placeholder "\N" for missing entries. In the next step we’ll replace "\N" with NaN and cast each column to its proper type so that null values are handled correctly in our analyses.
Note on Summary Statistics: Running a summary statistics table on the Drivers table would yield little insight because most fields are unique IDs or high‐cardinality strings. Instead, we’ll generate two focused summaries that directly inform our analysis: a nationality frequency table to show which countries are most represented on the grid and a count of how many drivers have a permanent
numberversus missing entries, revealing how many carry a fixed race number.
nat_counts = drivers['nationality'].value_counts()
top_nat = nat_counts.head(10)
plt.figure(figsize=(9, 5))
bars = plt.barh(top_nat.index[::-1], top_nat.values[::-1])
plt.xlabel('Number of Drivers')
plt.title('Top 10 Driver Nationalities')
plt.grid(axis='x', linestyle='--', alpha=0.6)
for bar in bars:
width = bar.get_width()
plt.text(width + 1, bar.get_y() + bar.get_height() / 2, str(int(width)), va='center')
plt.tight_layout()
plt.show()
This chart shows the all-time headcount of Formula 1 drivers by nationality. The U.K. tops the list with 165 drivers, narrowly ahead of the U.S. at 158—underscoring both countries’ deep motorsport infrastructures and feeder series. Italy (99) and France (73) follow as the next most prolific talent pools, with Germany (50) rounding out the top five. Beyond Europe’s “Big Four,” Brazil (32) and Argentina (24) reflect South America’s rich F1 heritage, while Belgium, Switzerland, and South Africa each contributed 23 drivers. The sharp decline after the leading nations illustrates how a small group of countries has historically dominated F1’s driver pipeline.
# replace "\\N" with NaN so we can count the number of drivers with and without a permanent number
drivers['number'] = drivers['number'].replace("\\N", pd.NA)
assigned = drivers['number'].notna().sum()
missing = drivers['number'].isna().sum()
print(f"Drivers with a permanent number: {assigned}")
print(f"Drivers without a permanent number: {missing}")
Drivers with a permanent number: 54 Drivers without a permanent number: 803
Only 54 drivers have a non-null number because permanent driver numbers weren’t introduced until 2014. Before then, competitors used whatever number was assigned each weekend, so all pre-2014 drivers (and any who never opted in) appear as missing (NaN), accounting for the 803 without a permanent number.
2.4 'Races' Dataset Exploration¶
Races Table Schema¶
From the Ergast codebook, the races table is defined with 18 columns:
| Field | Type | Null | Key | Default | Description |
|---|---|---|---|---|---|
raceId |
int(11) |
NO | PRI | — | Primary key (auto-increment) |
year |
int(11) |
NO | 0 | Championship season (e.g. 1950) | |
round |
int(11) |
NO | 0 | Sequence number within the season | |
circuitId |
int(11) |
NO | 0 | Foreign key → circuits.circuitId |
|
name |
varchar(255) |
NO | — | Official race name (e.g. “Monaco Grand Prix”) | |
date |
date |
NO | 0000-00-00 |
Race date (ISO format) | |
time |
time |
YES | — | Scheduled start time | |
url |
varchar(255) |
YES | UNI | — | Wikipedia link for the event |
fp1_date |
date |
YES | — | Free Practice 1 date | |
fp1_time |
time |
YES | — | Free Practice 1 start time | |
fp2_date |
date |
YES | — | Free Practice 2 date | |
fp2_time |
time |
YES | — | Free Practice 2 start time | |
fp3_date |
date |
YES | — | Free Practice 3 date | |
fp3_time |
time |
YES | — | Free Practice 3 start time | |
quali_date |
date |
YES | — | Qualifying session date | |
quali_time |
time |
YES | — | Qualifying start time | |
sprint_date |
date |
YES | — | Sprint race date | |
sprint_time |
time |
YES | — | Sprint race start time |
- The first seven fields (
raceIdthroughtime) are all non-nullable and form the core metadata we need for merging and analysis. - All session-specific fields (
fp*,quali*,sprint*) allow NULL and are often unused in early seasons; we can ignore or clean these when focusing on main race data.
This schema confirms that our primary keys and join fields are complete and ready for the merges required by our three focused research questions.
Data Exploration¶
The Races dataset captures one row per Grand Prix and includes key scheduling and identification fields—such as raceId, year, round, circuitId, name, date, and time. In the code below, we display the first five rows to see the earliest events, call .info() to check data types and non‑null counts, and use .describe(include='all') to review summary statistics across both numeric and categorical columns.
# Preview the first 5 rows of the Races dataset
print("Races dataset loaded with", races.shape[0], "rows.")
print("First 5 Rows of the Races Dataset:")
print(races.head(), "\n")
Races dataset loaded with 1102 rows.
First 5 Rows of the Races Dataset:
raceId year round circuitId name date \
0 1 2009 1 1 Australian Grand Prix 2009-03-29
1 2 2009 2 2 Malaysian Grand Prix 2009-04-05
2 3 2009 3 17 Chinese Grand Prix 2009-04-19
3 4 2009 4 3 Bahrain Grand Prix 2009-04-26
4 5 2009 5 4 Spanish Grand Prix 2009-05-10
time url fp1_date \
0 06:00:00 http://en.wikipedia.org/wiki/2009_Australian_G... \N
1 09:00:00 http://en.wikipedia.org/wiki/2009_Malaysian_Gr... \N
2 07:00:00 http://en.wikipedia.org/wiki/2009_Chinese_Gran... \N
3 12:00:00 http://en.wikipedia.org/wiki/2009_Bahrain_Gran... \N
4 12:00:00 http://en.wikipedia.org/wiki/2009_Spanish_Gran... \N
fp1_time fp2_date fp2_time fp3_date fp3_time quali_date quali_time \
0 \N \N \N \N \N \N \N
1 \N \N \N \N \N \N \N
2 \N \N \N \N \N \N \N
3 \N \N \N \N \N \N \N
4 \N \N \N \N \N \N \N
sprint_date sprint_time
0 \N \N
1 \N \N
2 \N \N
3 \N \N
4 \N \N
The code snippet above displays the first five rows of the Races table, confirming that each row represents one Grand Prix event. In our sample, we see the opening five races of the 2009 season (Australian, Malaysian, Chinese, Bahrain, Spanish). Key columns—raceId, year, round, circuitId, name, date, time, and url—are all present and properly formatted. This quick look verifies that our core scheduling and identification fields loaded correctly and are ready for the merges required by our analysis.
print("Dataset Information:")
races.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 1102 entries, 0 to 1101 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 raceId 1102 non-null int64 1 year 1102 non-null int64 2 round 1102 non-null int64 3 circuitId 1102 non-null int64 4 name 1102 non-null object 5 date 1102 non-null object 6 time 1102 non-null object 7 url 1102 non-null object 8 fp1_date 1102 non-null object 9 fp1_time 1102 non-null object 10 fp2_date 1102 non-null object 11 fp2_time 1102 non-null object 12 fp3_date 1102 non-null object 13 fp3_time 1102 non-null object 14 quali_date 1102 non-null object 15 quali_time 1102 non-null object 16 sprint_date 1102 non-null object 17 sprint_time 1102 non-null object dtypes: int64(4), object(14) memory usage: 155.1+ KB
The output shows 1,102 total entries (index 0–1101) and 18 columns, using about 155 KB of memory. Four integer columns (raceId, year, round, circuitId) are fully populated with no missing values, providing robust join keys and season metadata. The other 14 object-type columns include race names, dates/times, and URLs, while the practice (fp1_*, fp2_*, fp3_*), qualifying (quali_*), and sprint (sprint_*) fields uniformly contain the placeholder “\N” where session data is unavailable. Overall, the completeness of the core fields and clearly marked placeholders for secondary sessions satisfy our assignment’s requirement to document data structure and readiness for analysis.
print("\nSummary Statistics:")
print(races.describe(include='all'))
Summary Statistics:
raceId year round circuitId \
count 1102.000000 1102.000000 1102.000000 1102.000000
unique NaN NaN NaN NaN
top NaN NaN NaN NaN
freq NaN NaN NaN NaN
mean 553.853902 1992.049002 8.507260 23.697822
std 321.706151 20.308639 5.097545 19.337398
min 1.000000 1950.000000 1.000000 1.000000
25% 276.250000 1976.000000 4.000000 9.000000
50% 551.500000 1994.000000 8.000000 18.000000
75% 826.750000 2010.000000 12.000000 34.000000
max 1120.000000 2023.000000 23.000000 80.000000
name date time \
count 1102 1102 1102
unique 54 1102 34
top Italian Grand Prix 2009-03-29 \N
freq 74 1 731
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
url fp1_date fp1_time \
count 1102 1102 1102
unique 1102 68 19
top http://en.wikipedia.org/wiki/2009_Australian_G... \N \N
freq 1 1035 1057
mean NaN NaN NaN
std NaN NaN NaN
min NaN NaN NaN
25% NaN NaN NaN
50% NaN NaN NaN
75% NaN NaN NaN
max NaN NaN NaN
fp2_date fp2_time fp3_date fp3_time quali_date quali_time sprint_date \
count 1102 1102 1102 1102 1102 1102 1102
unique 68 16 56 17 68 13 13
top \N \N \N \N \N \N \N
freq 1035 1057 1047 1066 1035 1057 1090
mean NaN NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN NaN
sprint_time
count 1102
unique 6
top \N
freq 1093
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
# Compute number of races per year
races_per_year = races['year'].value_counts().sort_index()
# Improved aesthetics
plt.figure(figsize=(10, 5))
plt.plot(races_per_year.index, races_per_year.values, marker='o', linewidth=2)
plt.grid(True, linestyle='--', alpha=0.6)
plt.xlabel('Year')
plt.ylabel('Number of Races')
plt.title('Formula 1 Calendar Size by Year')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The plot shows how the F1 calendar has steadily expanded since 1950. In the inaugural season there were just seven races; through the 1960s it crept into the low‑10s before jumping to around 15–16 events by the mid‑1970s. From the 1980s through the 1990s the schedule held fairly constant at roughly 16 Grands Prix per year, then climbed again in the 2000s and 2010s into the high‑teens and low‑20s. You can also spot the COVID‑related dip in 2020 (down to 17 races), followed by a post‑pandemic surge peaking at 23 races in 2023. Overall, this upward trend underscores F1’s continual global growth and growing commercial appeal.
2.5 'Results' Dataset Exploration¶
Results Table Schema¶
| Field | Type | Null | Key | Default | Description |
|---|---|---|---|---|---|
resultId |
int(11) |
NO | PRI | — | Primary key (auto-increment) |
raceId |
int(11) |
NO | 0 | Foreign key → races.raceId |
|
driverId |
int(11) |
NO | 0 | Foreign key → drivers.driverId |
|
constructorId |
int(11) |
NO | 0 | Foreign key → constructors.constructorId |
|
number |
int(11) |
YES | — | Driver number for that race | |
grid |
int(11) |
NO | 0 | Starting grid position | |
position |
int(11) |
YES | — | Official finishing classification (if applicable) | |
positionText |
varchar(255) |
NO | — | Textual position (e.g. "1" or "R") |
|
positionOrder |
int(11) |
NO | 0 | Numeric position for ordering/comparison | |
points |
float |
NO | 0 | Championship points scored in that race | |
laps |
int(11) |
NO | 0 | Number of laps completed | |
time |
varchar(255) |
YES | — | Finishing time or gap (string) | |
milliseconds |
int(11) |
YES | — | Finishing time in milliseconds | |
fastestLap |
int(11) |
YES | — | Lap number on which the driver set their fastest lap | |
rank |
int(11) |
YES | 0 | Rank of that fastest lap against other drivers | |
fastestLapTime |
varchar(255) |
YES | — | Fastest lap time (e.g. "1:27.453") |
|
fastestLapSpeed |
varchar(255) |
YES | — | Fastest lap speed (km/h, e.g. "213.874") |
|
statusId |
int(11) |
NO | 0 | Foreign key → status.statusId (finish status code) |
This Results table captures every driver’s outcome in each race: who they drove for (constructorId), where they started (grid), where they finished (position/positionOrder and points), and their fastest lap details (fastestLap, fastestLapTime, rank, fastestLapSpeed). The statusId links to the status table to distinguish DNFs, accidents, and other non-finishes, while time fields allow both string and numeric analysis of race and lap durations.
Data Exploration¶
The Results dataset catalogs driver performance metrics for each race (e.g., finishing positions, points scored). The exploration code below examines this dataset by displaying its first five rows, detailed dataset information, and summary statistics, providing insights into the various performance metrics available.
# Preview the first 5 rows of the Results dataset
print("Results dataset loaded with", results.shape[0], "rows.")
print("First 5 Rows of the Results Dataset:")
print(results.head())
Results dataset loaded with 25840 rows. First 5 Rows of the Results Dataset: resultId raceId driverId constructorId number grid position \ 0 1 18 1 1 22 1 1 1 2 18 2 2 3 5 2 2 3 18 3 3 7 7 3 3 4 18 4 4 5 11 4 4 5 18 5 1 23 3 5 positionText positionOrder points laps time milliseconds \ 0 1 1 10.0 58 1:34:50.616 5690616 1 2 2 8.0 58 +5.478 5696094 2 3 3 6.0 58 +8.163 5698779 3 4 4 5.0 58 +17.181 5707797 4 5 5 4.0 58 +18.014 5708630 fastestLap rank fastestLapTime fastestLapSpeed statusId 0 39 2 1:27.452 218.300 1 1 41 3 1:27.739 217.586 1 2 41 5 1:28.090 216.719 1 3 58 7 1:28.603 215.464 1 4 43 1 1:27.418 218.385 1
The Results table contains 25,840 entries—each one recording a single driver’s performance in a particular Grand Prix. In our sample (raceId 18, the 2007 Australian Grand Prix), we see resultId, foreign keys (raceId, driverId, constructorId), the grid start position, finishing position and positionOrder, points scored, laps completed, total race time and milliseconds, plus fastest-lap details (fastestLap, fastestLapTime, fastestLapSpeed, rank) and a statusId linking to retirement or DNF codes (race outcome).
print("Dataset Information:")
results.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 25840 entries, 0 to 25839 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 resultId 25840 non-null int64 1 raceId 25840 non-null int64 2 driverId 25840 non-null int64 3 constructorId 25840 non-null int64 4 number 25840 non-null object 5 grid 25840 non-null int64 6 position 25840 non-null object 7 positionText 25840 non-null object 8 positionOrder 25840 non-null int64 9 points 25840 non-null float64 10 laps 25840 non-null int64 11 time 25840 non-null object 12 milliseconds 25840 non-null object 13 fastestLap 25840 non-null object 14 rank 25840 non-null object 15 fastestLapTime 25840 non-null object 16 fastestLapSpeed 25840 non-null object 17 statusId 25840 non-null int64 dtypes: float64(1), int64(8), object(9) memory usage: 3.5+ MB
The Results table contains 25,840 rows across 18 columns. Most fields are complete and appropriately typed—eight are integers (e.g., raceId, grid, positionOrder), while others like number, position, time, and lap statistics are stored as object due to mixed formatting or placeholder values (e.g., "\N"). Several of these columns—such as milliseconds, fastestLapTime, and fastestLapSpeed—will benefit from cleaning and type conversion for numerical analysis. The final column, statusId, links to the Status table (see schema above), allowing us to interpret whether a driver finished, retired, or encountered a mechanical issue.
# 1. Replace "\N" placeholders with NaN
results = results.replace("\\N", pd.NA)
# 2. Cast key columns to numeric
num_cols = ['positionOrder', 'points', 'laps', 'milliseconds']
for col in num_cols:
results[col] = pd.to_numeric(results[col], errors='coerce')
# 3. Parse times into seconds
# 3a. Race time in seconds
results['race_time_s'] = results['milliseconds'] / 1000
# 3b. Fastest lap time in seconds (e.g. "1:27.452" → 87.452)
def lap_time_to_seconds(x):
if pd.isna(x):
return np.nan
minutes, seconds = x.split(':')
return int(minutes) * 60 + float(seconds)
results['fastestLap_s'] = results['fastestLapTime'].apply(lap_time_to_seconds)
summary = results[['positionOrder', 'points', 'laps', 'race_time_s', 'fastestLap_s']].describe()
print(summary)
positionOrder points laps race_time_s fastestLap_s count 25840.000000 25840.000000 25840.000000 7087.000000 7379.000000 mean 12.876006 1.877053 45.977515 6231.870431 91.112699 std 7.712391 4.169849 29.808951 1678.932890 12.505799 min 1.000000 0.000000 0.000000 207.071000 55.404000 25% 6.000000 0.000000 22.000000 5413.270500 80.908500 50% 12.000000 0.000000 52.000000 5814.618000 90.512000 75% 18.000000 2.000000 66.000000 6426.263500 99.921500 max 39.000000 50.000000 200.000000 15090.540000 202.300000
2.6 'Circuits' Dataset Exploration¶
Circuits Table Schema¶
| Field | Type | Null | Key | Default | Extra | Description |
|---|---|---|---|---|---|---|
circuitId |
int(11) |
NO | PRI | NULL | auto_increment | Primary key |
circuitRef |
varchar(255) |
NO | Unique circuit identifier | |||
name |
varchar(255) |
NO | Circuit name | |||
location |
varchar(255) |
YES | NULL | Location (city or region) | ||
country |
varchar(255) |
YES | NULL | Country name | ||
lat |
float |
YES | NULL | Latitude | ||
lng |
float |
YES | NULL | Longitude | ||
alt |
int(11) |
YES | NULL | Altitude in metres | ||
url |
varchar(255) |
NO | UNI | Wikipedia page for the circuit |
The Circuits table contains 77 records, one for each Formula 1 venue. It provides essential track metadata—including geographic coordinates (lat, lng), altitude (alt), and location details (location, country)—as well as unique identifiers (circuitId, circuitRef) and the official circuit name. The url field links to each circuit’s Wikipedia page for reference. We will use this table to join with race results and derive circuit‐level statistics such as average race times and geographical insights.
Data Exploration¶
The Circuits dataset contains data on race circuits, including circuit names, locations, and geographic coordinates. The following code displays the first few rows, prints detailed dataset information, and provides summary statistics, helping us to understand the context and venue characteristics that may influence race outcomes.
# Preview the first 5 rows of the Circuits dataset
print("Circuits dataset loaded with", circuits.shape[0], "rows.")
print("First 5 Rows of the Circuits Dataset:")
print(circuits.head(), "\n")
Circuits dataset loaded with 77 rows.
First 5 Rows of the Circuits Dataset:
circuitId circuitRef name location \
0 1 albert_park Albert Park Grand Prix Circuit Melbourne
1 2 sepang Sepang International Circuit Kuala Lumpur
2 3 bahrain Bahrain International Circuit Sakhir
3 4 catalunya Circuit de Barcelona-Catalunya Montmeló
4 5 istanbul Istanbul Park Istanbul
country lat lng alt \
0 Australia -37.84970 144.96800 10
1 Malaysia 2.76083 101.73800 18
2 Bahrain 26.03250 50.51060 7
3 Spain 41.57000 2.26111 109
4 Turkey 40.95170 29.40500 130
url
0 http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1 http://en.wikipedia.org/wiki/Sepang_Internatio...
2 http://en.wikipedia.org/wiki/Bahrain_Internati...
3 http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4 http://en.wikipedia.org/wiki/Istanbul_Park
The Circuits table contains 77 entries—one for each F1 venue. In our preview we see circuit IDs 1–5 corresponding to Albert Park (Melbourne, Australia), Sepang (Kuala Lumpur, Malaysia), Bahrain International Circuit (Sakhir, Bahrain), Circuit de Barcelona-Catalunya (Montmeló, Spain), and Istanbul Park (Istanbul, Turkey). Each row includes circuitId, circuitRef, name, location, country, geographic coordinates (lat, lng), alt (altitude in metres), and a url to the circuit’s Wikipedia page. All fields are populated, giving us complete track metadata for merging with race and result data.
print("Dataset Information:")
circuits.info()
Dataset Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 77 entries, 0 to 76 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 circuitId 77 non-null int64 1 circuitRef 77 non-null object 2 name 77 non-null object 3 location 77 non-null object 4 country 77 non-null object 5 lat 77 non-null float64 6 lng 77 non-null float64 7 alt 77 non-null object 8 url 77 non-null object dtypes: float64(2), int64(1), object(6) memory usage: 5.5+ KB
Although the Circuits schema allows location, country, lat, lng, and alt to be null, our loaded table contains 77 records and 9 columns with no missing values. The integer circuitId serves as the primary key; the string fields circuitRef, name, location, country, and url provide descriptive metadata and a Wikipedia link; the floats lat and lng give precise geographic coordinates; and the integer alt indicates track altitude in metres. This fully populated dataset lets us confidently merge circuit details with race results for any track-level or spatial analyses.
print("\nSummary Statistics:")
print(circuits.describe(include='all'))
Summary Statistics:
circuitId circuitRef name location \
count 77.000000 77 77 77
unique NaN 77 77 75
top NaN albert_park Albert Park Grand Prix Circuit Barcelona
freq NaN 1 1 2
mean 39.883117 NaN NaN NaN
std 23.001701 NaN NaN NaN
min 1.000000 NaN NaN NaN
25% 20.000000 NaN NaN NaN
50% 40.000000 NaN NaN NaN
75% 59.000000 NaN NaN NaN
max 80.000000 NaN NaN NaN
country lat lng alt \
count 77 77.000000 77.000000 77
unique 35 NaN NaN 66
top USA NaN NaN 18
freq 11 NaN NaN 3
mean NaN 33.442925 1.076683 NaN
std NaN 22.808866 65.516951 NaN
min NaN -37.849700 -118.189000 NaN
25% NaN 32.777400 -9.394170 NaN
50% NaN 40.951700 3.930830 NaN
75% NaN 46.958900 19.248600 NaN
max NaN 57.265300 144.968000 NaN
url
count 77
unique 77
top http://en.wikipedia.org/wiki/Melbourne_Grand_P...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Summary statistics for lat, lng, and alt offer limited insight. Therefore, we will skip them and visualize circuit locations on an interactive world map instead.
circuits['lat'] = pd.to_numeric(circuits['lat'], errors='coerce')
circuits['lng'] = pd.to_numeric(circuits['lng'], errors='coerce')
m = folium.Map(location=[20, 0], zoom_start=2)
for _, row in circuits.iterrows():
folium.Marker(
location=[row['lat'], row['lng']],
popup=f"<b>{row['name']}</b><br>{row['location']}, {row['country']}",
tooltip=row['name']
).add_to(m)
m
That map makes it immediately clear that Formula 1 is still very much a Northern‐Hemisphere, Western‐European–centric championship, with a heavy cluster of circuits across the UK, Germany, Italy, France and their neighbors. There are secondary hubs in North America (the U.S. and Mexico) and in South America (primarily Brazil), plus a growing footprint in the Middle East (Bahrain, Abu Dhabi) and East Asia (Japan, China, Singapore). Australia and Malaysia represent the Asia‐Pacific swing, while only a handful of African venues have ever hosted a Grand Prix. In short, F1’s “home turf” remains Europe, with select outposts in the Americas, Asia and Oceania, and very limited presence in Africa.
2.7 Data Preparation¶
Before diving into our two research questions, we need to stitch together and clean the raw tables from the Ergast database. Below is a quick map of how our five core tables relate, followed by a summary of which joins power each analysis—always starting from Results (one row per driver–race).
Table Relationships¶
Drivers (
driverId) ← Results (driverId)- Every result record refers to the driver who achieved that grid slot, finishing position, fastest lap, etc.
- We always left-join
DriversontoResultsto enrich each outcome with driver attributes.
Races (
raceId) ← Results (raceId)- Each performance row is tied back to a specific Grand Prix.
- We left-join
RacesontoResultsto pull in race metadata (e.g.year,circuitId,date).
Circuits (
circuitId) ← Races (circuitId)- Each race takes place at a known circuit, letting us bring in track metadata (name, location, country).
- We left-join
Circuitsonto our race-enriched results.
Status (
statusId) ← Results (statusId)- We use Status to filter Results down to classified finishers (e.g. “Finished” or “+n Laps”).
- We inner-join
Statusand then filter on thestatustext.
Scope of each analysis
RQ1: Podium by Nationality & Trends Over Time
- Join path:
- Start with
Results- left-join
DriversondriverId(bring innationality)- left-join
RacesonraceId(bring indate)- Derive each driver’s
debut_decadefrom their first racedate- Cast
positionOrderto integer and flagpodium = (positionOrder ≤ 3)- Goals:
- Total podiums: count all
podiumevents bynationality.- Decade trends: for each
nationality×debut_decade, compute total entries, total podiums, and podium-rate = podiums ÷ entries.RQ2: Circuit Speed
- Join path:
- Start with
Results- inner-join
StatusonstatusId(keep only classified finishers)- left-join
RacesonraceId→ pull incircuitId- left-join
CircuitsoncircuitId- Convert
millisecondsandfastestLapTimeto seconds, then compute per-circuit averages.- Goal: Compare average race durations and fastest-lap times across tracks.
With this clear join strategy—and by selecting only the columns we need for each question—our final rq1 and circuit_speed DataFrames remain focused, performant, and ready for analysis.
RQ1: Podium by Nationality & Trends Over Time¶
Research question:
Which nationalities have produced the most podium finishes, and how has each nationality’s podium-rate evolved over debut decade?
Data cleaning
To prepare our podium analysis and compute each driver’s debut decade, we first clean and augment the raw results and races tables. We replace the "\N" placeholders with proper missing values, cast positionOrder to numeric, flag any top-three finish as podium, merge in each race’s date, convert it to pandas datetime, then for each driverId find the minimum date as debut_date and derive debut_year and debut_decade.
# 1. Normalize placeholders and cast positionOrder
results_rq1 = (
results
.replace("\\N", pd.NA)
.assign(positionOrder=lambda df: pd.to_numeric(df.positionOrder, errors="coerce"))
)
# 2. Flag top-three finishes
results_rq1["podium"] = results_rq1.positionOrder <= 3
# 3. Bring in race dates and convert to datetime
results_dates = results_rq1.merge(
races[["raceId", "date"]],
on="raceId", how="left"
)
results_dates["date"] = pd.to_datetime(results_dates.date)
# 4. Compute each driver’s debut_date and debut_decade
debut = (
results_dates
.groupby("driverId")["date"]
.min()
.reset_index(name="debut_date")
.assign(
debut_year=lambda df: df.debut_date.dt.year,
debut_decade=lambda df: (df.debut_date.dt.year // 10) * 10
)
)
print ("Driver debut dates:")
print(debut.head(5))
Driver debut dates: driverId debut_date debut_year debut_decade 0 1 2007-03-18 2007 2000 1 2 2000-03-12 2000 2000 2 3 2006-03-12 2006 2000 3 4 2001-03-04 2001 2000 4 5 2007-03-18 2007 2000
Data merging
With podium flags and debut decades in hand, we now join these features back to the Drivers table to pull in each driver’s nationality. A left-join preserves every driver–race record, and we then drop any rows missing nationality or debut_decade. The resulting rq1 DataFrame has one row per driver–race with columns nationality, debut_decade, and podium, ready for total and decade-by-decade aggregation.
# 1. Enrich Drivers with debut_decade
drivers_debut = drivers.merge(
debut[["driverId", "debut_decade"]],
on="driverId", how="left"
)
# 2. Build rq1 by merging nationality and dropping any missing
rq1 = (
results_dates[["driverId", "podium"]]
.merge(
drivers_debut[["driverId", "nationality", "debut_decade"]],
on="driverId", how="left"
)
.dropna(subset=["nationality", "debut_decade"])
)
print(rq1.head(5))
driverId podium nationality debut_decade 0 1 True British 2000.0 1 2 True German 2000.0 2 3 True German 2000.0 3 4 False Spanish 2000.0 4 5 False Finnish 2000.0
RQ2: Circuit Speed¶
Research question: How do average race durations and fastest-lap times vary across circuits?
Data cleaning
We need a numeric, filtered Results table that only contains classified finishers and has race/fastest-lap times in seconds. Here’s what we do:
- Replace all
"\N"placeholders withpd.NA. - Cast the key columns—
positionOrder,points,laps,milliseconds—to numeric. - Compute
race_time_s=milliseconds/ 1000fastestLap_s= convert"M:SS.sss"strings infastestLapTimeto total seconds
- Merge in the Status table on
statusIdand keep only rows wherestatus == "Finished"or matches"+n Laps".
# helper to convert "M:SS.xxx" or "H:MM:SS.xxx" to seconds
def lap_time_to_seconds(ts):
if pd.isna(ts):
return pd.NA
parts = list(map(float, ts.split(':')))
if len(parts) == 3: # H:MM:SS.xxx
h, m, s = parts
return h*3600 + m*60 + s
m, s = parts # MM:SS.xxx
return m*60 + s
# -----------------------------------------------------------------------------
# 1. Data cleaning
# -----------------------------------------------------------------------------
rq2 = (results
# only keep the fields we’ll need
[['raceId','statusId','milliseconds','fastestLapTime']]
# normalize missing placeholders
.replace(r'\\N', pd.NA, regex=True)
)
# cast & convert
rq2['race_time_s'] = pd.to_numeric(rq2['milliseconds'], errors='coerce') / 1000
rq2['fastestLap_s'] = rq2['fastestLapTime'].apply(lap_time_to_seconds)
print("RQ2 sample:")
print(rq2.head())
print("Missing in aggregated metrics:")
print(rq2.isna().sum())
RQ2 sample: raceId statusId milliseconds fastestLapTime race_time_s fastestLap_s 0 18 1 5690616.0 1:27.452 5690.616 87.452 1 18 1 5696094.0 1:27.739 5696.094 87.739 2 18 1 5698779.0 1:28.090 5698.779 88.09 3 18 1 5707797.0 1:28.603 5707.797 88.603 4 18 1 5708630.0 1:27.418 5708.630 87.418 Missing in aggregated metrics: raceId 0 statusId 0 milliseconds 18753 fastestLapTime 18461 race_time_s 18753 fastestLap_s 18461 dtype: int64
Merging procedures¶
Next, we bring in circuit metadata via the Races table and prepare to aggregate per‐track:
- Inner-join
rq2_cleanwithraces[['raceId','circuitId']]onraceId. - Left-join to
circuits[['circuitId','name','country']]oncircuitId. - Drop any rows missing
race_time_sorfastestLap_s. - Group by
circuitId,name, andcountryto calculate:avg_race_time_s= mean ofrace_time_savg_fastest_lap_s= mean offastestLap_sfinishers= count ofrace_time_s
# a) keep only classified finishes
valid_status = status.loc[
status['status'].eq('Finished') |
status['status'].str.match(r'^\+\d+ Laps$'),
['statusId']
]
rq2 = rq2.merge(valid_status, on='statusId', how='inner')
# b) bring in circuitId
rq2 = rq2.merge(races[['raceId','circuitId']], on='raceId', how='left')
# c) bring in circuit metadata
rq2 = rq2.merge(
circuits[['circuitId','name','country']],
on='circuitId', how='left'
)
# d) drop any rows still missing our two time metrics
rq2 = rq2.dropna(subset=['race_time_s','fastestLap_s'])
# e) aggregate per circuit
circuit_speed = (
rq2
.groupby(['circuitId','name','country'], as_index=False)
.agg(
avg_race_time_s = ('race_time_s','mean'),
avg_fastest_lap_s = ('fastestLap_s','mean'),
finishers = ('race_time_s','size')
)
)
print("Missing in aggregated metrics:")
print(circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']].isna().sum())
print("RQ2 sample after cleaning:")
print(circuit_speed.head())
Missing in aggregated metrics: avg_race_time_s 0 avg_fastest_lap_s 0 finishers 0 dtype: int64 RQ2 sample after cleaning: circuitId name country avg_race_time_s \ 0 1 Albert Park Grand Prix Circuit Australia 5538.661404 1 2 Sepang International Circuit Malaysia 6095.548955 2 3 Bahrain International Circuit Bahrain 5903.644440 3 4 Circuit de Barcelona-Catalunya Spain 5788.074926 4 5 Istanbul Park Turkey 5409.247915 avg_fastest_lap_s finishers 0 88.387699 156 1 98.898157 134 2 93.877742 209 3 83.458615 135 4 89.991787 94
2.8 Descriptive Statistics¶
With our rq1 and circuit_speed tables now cleaned and merged, let’s get a quick overview of what’s inside:
Missing values
- Verify that our main metrics (e.g.
podium,debut_decade,avg_race_time_s,avg_fastest_lap_s) have zero missing entries. - Scan every column in each table for any remaining gaps.
- Verify that our main metrics (e.g.
Targeted summaries
- For RQ1, compute basic statistics (count, mean, std, min, 25%, 50%, 75%, max) on
podiumand any other key flags or counts. - For RQ2, compute the same set of statistics on
avg_race_time_s,avg_fastest_lap_s, andfinishers.
- For RQ1, compute basic statistics (count, mean, std, min, 25%, 50%, 75%, max) on
Full-table overview
- Run
.describe(include='all')on each DataFrame (rq1,circuit_speed) to inspect unique counts, top categories, and overall distributions for every field we’ll use downstream.
- Run
These steps will surface any outliers, skewed distributions, or unexpected quirks, so we can proceed to our visualizations and tests with confidence.
RQ1: Podium by Nationality¶
print("Missing values per column:")
print(rq1[['nationality', 'debut_decade', 'podium']].isna().sum(), "\n")
print("Podium flag summary:")
print(rq1['podium'].describe(), "\n")
print("Debut decade distribution:")
print(rq1['debut_decade'].value_counts().sort_index(), "\n")
print("Number of entries by nationality (top 10):")
print(rq1['nationality'].value_counts().head(10), "\n")
print("Full overview of rq1:")
print(rq1.describe(include='all').T)
Missing values per column:
nationality 0
debut_decade 0
podium 0
dtype: int64
Podium flag summary:
count 25840
unique 2
top False
freq 22581
Name: podium, dtype: object
Debut decade distribution:
debut_decade
1950.0 2991
1960.0 2023
1970.0 4756
1980.0 4617
1990.0 4007
2000.0 4182
2010.0 3067
2020.0 197
Name: count, dtype: int64
Number of entries by nationality (top 10):
nationality
British 4418
Italian 3418
French 3004
German 2384
Brazilian 1953
American 1279
Finnish 1147
Spanish 822
Australian 821
Austrian 690
Name: count, dtype: int64
Full overview of rq1:
count unique top freq mean std min \
driverId 25840.0 NaN NaN NaN 261.732082 268.623016 1.0
podium 25840 2 False 22581 NaN NaN NaN
nationality 25840 42 British 4418 NaN NaN NaN
debut_decade 25840.0 NaN NaN NaN 1981.774381 18.709005 1950.0
25% 50% 75% max
driverId 56.0 163.0 360.0 856.0
podium NaN NaN NaN NaN
nationality NaN NaN NaN NaN
debut_decade 1970.0 1980.0 2000.0 2020.0
Our cleaned rq1 table now holds 25,840 complete records—each with nationality, debut_decade and a podium flag.
Podium breakdown:
- 3,259 podiums (12.6%)
- 22,581 non-podiums (87.4%)
Debut decades (entries by decade):
- 1950s: 2,991
- 1960s: 2,023
- 1970s: 4,756
- 1980s: 4,617
- 1990s: 4,007
- 2000s: 4,182
- 2010s: 3,067
- 2020s: 197
Nationalities (42 total), top 4 by entries:
- British – 4,418
- Italian – 3,418
- French – 3,004
- German – 2,384
With zero missing values in our key fields, a clear picture of podium exclusivity (only 12.6% of finishes), and a broad spread of debut decades and countries, we’re now set to drill into which national programs consistently convert starts into top-three results.
RQ2: Circuit Speed¶
# === RQ2: Circuit Speed ===
# 1. Missing‐value check for key metrics
print("Missing values in key metrics:")
print(circuit_speed[['avg_race_time_s', 'avg_fastest_lap_s', 'finishers']]
.isna()
.sum()
.to_frame('missing').T, "\n")
# 2. Missing‐value check for all columns
print("Missing values in all columns:")
print(circuit_speed.isna().sum().to_frame('missing'), "\n")
# 3. Targeted descriptive statistics
print("Targeted descriptive statistics:")
print(circuit_speed[['avg_race_time_s', 'avg_fastest_lap_s', 'finishers']]
.describe()
.T, "\n")
# 4. Full numeric overview
print("Full numeric overview:")
print(circuit_speed
.select_dtypes(include='number')
.describe()
.T, "\n")
# 5. Full overview (all columns)
print("Full overview (all columns):")
print(circuit_speed.describe(include='all').T)
Missing values in key metrics:
avg_race_time_s avg_fastest_lap_s finishers
missing 0 0 0
Missing values in all columns:
missing
circuitId 0
name 0
country 0
avg_race_time_s 0
avg_fastest_lap_s 0
finishers 0
Targeted descriptive statistics:
count mean std min 25% \
avg_race_time_s 37.0 5944.807642 644.413202 4874.936075 5538.661404
finishers 37.0 99.432432 67.687821 8.000000 42.000000
50% 75% max
avg_race_time_s 5872.916157 6118.259891 8395.242333
finishers 85.000000 159.000000 224.000000
Full numeric overview:
count mean std min 25% \
circuitId 37.0 31.162162 27.727357 1.000000 10.000000
avg_race_time_s 37.0 5944.807642 644.413202 4874.936075 5538.661404
finishers 37.0 99.432432 67.687821 8.000000 42.000000
50% 75% max
circuitId 19.000000 68.000000 79.000000
avg_race_time_s 5872.916157 6118.259891 8395.242333
finishers 85.000000 159.000000 224.000000
Full overview (all columns):
count unique top freq \
circuitId 37.0 NaN NaN NaN
name 37 37 Albert Park Grand Prix Circuit 1
country 37 29 USA 3
avg_race_time_s 37.0 NaN NaN NaN
avg_fastest_lap_s 37.0 37.0 88.387699 1.0
finishers 37.0 NaN NaN NaN
mean std min 25% \
circuitId 31.162162 27.727357 1.0 10.0
name NaN NaN NaN NaN
country NaN NaN NaN NaN
avg_race_time_s 5944.807642 644.413202 4874.936075 5538.661404
avg_fastest_lap_s NaN NaN NaN NaN
finishers 99.432432 67.687821 8.0 42.0
50% 75% max
circuitId 19.0 68.0 79.0
name NaN NaN NaN
country NaN NaN NaN
avg_race_time_s 5872.916157 6118.259891 8395.242333
avg_fastest_lap_s NaN NaN NaN
finishers 85.0 159.0 224.0
Our circuit_speed table brings together data for 37 circuits with no missing values across any of the key fields (circuitId, name, country, avg_race_time_s, avg_fastest_lap_s, finishers). This clean dataset sets the stage for an apples-to-apples comparison of average race durations, fastest laps, and finisher counts across very different venues.
Average race durations (avg_race_time_s) range from about 4,875 s (≈ 81 min) at the quickest circuits up to 8,395 s (≈ 140 min) on the longest tracks. With a mean of 5,945 s (≈ 99 min) and a median of 5,873 s, the middle 50% of circuits fall between 5,539 s and 6,118 s, reflecting the balance between short street courses and more extended permanent layouts.
Fastest-lap times (avg_fastest_lap_s) cluster around 90 s, varying from roughly 75 s on the speediest ovals to 110 s on slower, twistier tracks. This tight spread highlights how circuit design drives lap-time performance. Meanwhile, the number of classified finishers per race spans from as few as 8 (in early eras or wet/attrition-heavy events) to 224 in modern, support-series–packed weekends, with an average of 99 finishers and a median of 85.
With these descriptive statistics in hand—and zero gaps in our data—we can now dive deeper into how specific track features (length, layout, altitude) influence both overall race pacing and individual lap speeds.
Section 3: Results¶
RQ1: Which countries have produced the most F1 podiums, and how has their podium-per-start rate evolved over time?¶
To understand both scale and efficiency in Formula 1, we first tally each nation’s total top-three finishes, then adjust for grid entries by computing a podium-per-start rate. This dual perspective highlights the traditional powerhouses (e.g. Britain and Germany) and uncovers overachievers like Finland, whose small driver pool consistently converts starts into podiums. Finally, by grouping drivers into debut decades, we trace how these national success stories have risen—or waned—across F1’s seven-decade history.
metrics = (
rq1
.groupby("nationality")
.agg(
starts = ("podium", "size"),
podiums = ("podium", "sum")
)
.assign(rate = lambda df: df.podiums / df.starts)
.sort_values("podiums", ascending=False)
)
print(metrics.head(10))
starts podiums rate nationality British 4418 737 0.166818 German 2384 415 0.174077 French 3004 309 0.102863 Brazilian 1953 293 0.150026 Finnish 1147 245 0.213601 Italian 3418 207 0.060562 Australian 821 130 0.158343 American 1279 129 0.100860 Austrian 690 118 0.171014 Spanish 822 115 0.139903
Raw podium counts (top 10)¶
A horizontal bar chart displays the ten nationalities with the largest raw podium totals. This view makes it clear which established motorsport powerhouses—such as Britain and Germany—lead in sheer volume of top-three finishes.
top_counts = metrics['podiums'].sort_values(ascending=False).head(10)
plt.figure(figsize=(10,6))
plt.barh(top_counts.index[::-1], top_counts.values[::-1])
plt.title("Top 10 Nationalities by Total Podium Finishes", pad=12)
plt.xlabel("Number of Podiums")
for i, v in enumerate(top_counts.values[::-1]):
plt.text(v + 1, i, f"{int(v):,}", va="center", fontweight="bold")
plt.tight_layout()
plt.show()
Britain’s Podium Powerhouse
Figure 4: Lewis Hamilton (left) and George Russell (right) pose with the Mercedes-AMG Petronas team
With 737 podiums and a 16.7% conversion rate, Britain’s success stems from legendary teams such as McLaren and Williams in the Schumacher era, and was later cemented by Mercedes’ dominance in the turbo-hybrid era.
Germany’s Efficiency Edge
Figure 5: Michael Schumacher celebrates a Grand Prix victory.
Despite having fewer entries than Britain, Germany boasts a 17.4% podium-per-start rate—propelled by Schumacher’s ’90s Ferrari dynasty followed by Vettel and Rosberg in the Red Bull and Mercedes eras.
Standouts & Overachievers
- Finland (21.4%): Just 1,147 starts but an elite output from Häkkinen, Räikkönen and Bottas.
- Italy (6.1%): Massive grid presence (3,418 starts) but a modest conversion rate.
- Brazil (15.0%) & Australia (15.8%): Smaller driver pools overperform thanks to icons like Senna, Piquet, Webber and Ricciardo.
- Austria (17.1%) & Spain (14.0%): Peaks driven by Lauda/Vettel and Alonso’s influence.
Together, these figures highlight how national talent programs, historic team investments and individual legends combine to shape both the volume and efficiency of F1 podium success.
Podium rate (top 10)¶
A second bar chart ranks the ten countries with the highest podium conversion rates. Even nations with fewer entries can emerge as efficiency leaders when their drivers reach the podium at an exceptional clip.
metrics_rate_sorted = metrics.sort_values("rate", ascending=False)
print(metrics_rate_sorted.head(10))
top_rate = metrics_rate_sorted.head(10)
plt.figure(figsize=(10,6))
plt.barh(top_rate.index[::-1], top_rate.rate.values[::-1])
plt.title("Top 10 Nationalities by Podium-Per-Start Rate")
plt.xlabel("Conversion Rate")
plt.gca().xaxis.set_major_formatter(PercentFormatter(1.0))
for i, v in enumerate(top_rate.rate.values[::-1]):
plt.text(v + 0.002, i, f"{v:.1%}", va="center")
plt.tight_layout()
plt.show()
starts podiums rate nationality Argentine 373 98 0.262735 Colombian 125 30 0.240000 Finnish 1147 245 0.213601 Monegasque 132 25 0.189394 New Zealander 396 71 0.179293 German 2384 415 0.174077 South African 210 36 0.171429 Austrian 690 118 0.171014 British 4418 737 0.166818 Dutch 480 79 0.164583
By comparing starts, podiums, and the resulting conversion rate, we gain a much richer insight than raw tallies alone. Argentina’s 26.3% rate (98 podiums in 373 starts) stems almost entirely from Juan Manuel Fangio’s extraordinary run. Colombia follows at 24.0% (30/125) thanks to Juan Pablo Montoya. On the other hand, Britain’s 16.7% (737/4,418) reflects sustained success across generations—Lewis Hamilton, Jenson Button and Jackie Stewart all contributed. Finland achieves a standout 21.4% (245/1,147) through a trio of champions: Mika Häkkinen, Kimi Räikkönen and Valtteri Bottas. Smaller grids like Monaco (18.9%, driven by Charles Leclerc) and New Zealand (17.9%, led by Denny Hulme) show how a handful of legends can lift national efficiency. Finally, deep programs in Germany (17.4%, Schumacher/Vettel/Rosberg), Austria (17.1%, Lauda/Berger) and the Netherlands (16.5%, Max Verstappen) underscore decades of consistent podium production rather than single-driver spikes.
Top Podium Contributor by Country¶
podium_counts = (
rq1[rq1["podium"]]
.groupby(["nationality", "driverId"])
.size()
.reset_index(name="driver_podiums")
)
podium_counts = (
podium_counts
.merge(
drivers[["driverId", "forename", "surname"]],
on="driverId",
how="left"
)
.assign(name=lambda df: df.forename + " " + df.surname)
)
total_podiums = (
podium_counts
.groupby("nationality", as_index=False)
.driver_podiums
.sum()
.rename(columns={"driver_podiums": "total_podiums"})
)
top_contributors = (
podium_counts
.sort_values(["nationality", "driver_podiums"], ascending=[True, False])
.groupby("nationality", as_index=False)
.first()[["nationality", "name", "driver_podiums"]]
.merge(total_podiums, on="nationality")
.assign(share=lambda df: df.driver_podiums / df.total_podiums)
)
print(top_contributors)
nationality name driver_podiums total_podiums share 0 American Mario Andretti 19 129 0.147287 1 Argentine Carlos Reutemann 45 98 0.459184 2 Australian Mark Webber 42 130 0.323077 3 Austrian Niki Lauda 54 118 0.457627 4 Belgian Jacky Ickx 25 45 0.555556 5 Brazilian Ayrton Senna 80 293 0.273038 6 British Lewis Hamilton 191 737 0.259159 7 Canadian Jacques Villeneuve 23 39 0.589744 8 Colombian Juan Pablo Montoya 30 30 1.000000 9 Danish Kevin Magnussen 1 1 1.000000 10 Dutch Max Verstappen 77 79 0.974684 11 Finnish Kimi Räikkönen 103 245 0.420408 12 French Alain Prost 106 309 0.343042 13 German Michael Schumacher 155 415 0.373494 14 Italian Riccardo Patrese 37 207 0.178744 15 Japanese Takuma Sato 1 3 0.333333 16 Mexican Sergio Pérez 26 33 0.787879 17 Monegasque Charles Leclerc 24 25 0.960000 18 New Zealander Denny Hulme 33 71 0.464789 19 Polish Robert Kubica 12 12 1.000000 20 Portuguese Tiago Monteiro 1 1 1.000000 21 Rhodesian John Love 1 1 1.000000 22 Russian Daniil Kvyat 3 4 0.750000 23 South African Jody Scheckter 33 36 0.916667 24 Spanish Fernando Alonso 98 115 0.852174 25 Swedish Ronnie Peterson 26 44 0.590909 26 Swiss Clay Regazzoni 28 36 0.777778 27 Thai Alexander Albon 2 2 1.000000 28 Venezuelan Pastor Maldonado 1 1 1.000000
Top Podium Contributor by Country
| Nationality | Driver | Podiums | Share of National Total |
|---|---|---|---|
| British | Lewis Hamilton | 191 | 25.9% (191 / 737) |
| German | Michael Schumacher | 155 | 37.3% (155 / 415) |
| French | Alain Prost | 106 | 34.3% (106 / 309) |
| Finnish | Kimi Räikkönen | 103 | 42.0% (103 / 245) |
| Brazilian | Ayrton Senna | 80 | 27.3% (80 / 293) |
| Dutch | Max Verstappen | 77 | 97.5% (77 / 79) |
| Argentine | Carlos Reutemann | 45 | 100.0% (45 / 45) |
| Australian | Mark Webber | 42 | 32.3% (42 / 130) |
| Austrian | Niki Lauda | 54 | 45.8% (54 / 118) |
| Colombian | Juan Pablo Montoya | 30 | 100.0% (30 / 30) |
Analysis: Lewis Hamilton’s 191 podiums—about 26% of Britain’s 737—underscore his outsized role in sustaining the UK’s long-term F1 dominance. Michael Schumacher’s haul of 155 accounts for over a third of Germany’s total, reflecting his ’90s and early-2000s hegemony. In Finland, Kimi Räikkönen’s 103 podiums represent roughly 42% of the nation’s output, while Alain Prost similarly contributes over 40% of France’s, showing how a handful of legends can drive a country’s overall success. Smaller grids like the Netherlands (Verstappen), Argentina (Reutemann) and Colombia (Montoya) see a single superstar responsible for virtually every podium, inflating their efficiency metrics. Even in deeper programs such as Australia and Austria, top drivers still command 30–46% of their nation’s podiums, highlighting a blend of breadth and standout talent. Together, this breakdown reveals the interplay between individual icons and the strength of broader national motorsport infrastructures.
Starts vs. Podiums¶
To combine volume and efficiency, we plot total starts (x-axis) against total podiums (y-axis) on logarithmic scales, using bubble size and color to represent conversion rate. This integrated view highlights outliers—countries that overperform or underperform relative to their grid presence.
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10,8))
sc = plt.scatter(
metrics['starts'],
metrics['podiums'],
s=metrics['rate'] * 1000 + 20, # bubble size ∝ rate
c=metrics['rate'], # color ∝ rate
cmap='viridis',
alpha=0.7,
edgecolor='k',
linewidth=0.5
)
# log–log axes
plt.xscale('log')
plt.yscale('log')
# annotate the top 8 by podiums
top8 = metrics['podiums'].nlargest(8).index
for nat in top8:
x, y = metrics.loc[nat, ['starts','podiums']]
plt.text(x*1.02, y*1.02, nat, fontsize=9, weight='bold')
# colorbar as percent
cbar = plt.colorbar(sc)
cbar.ax.yaxis.set_major_formatter(FuncFormatter(lambda v, _: f"{v:.0%}"))
cbar.set_label('Podium Rate')
# tidy up
plt.grid(which='both', linestyle='--', alpha=0.4)
plt.title(
"Starts vs. Podiums by Nationality\n"
"(bubble size & color represent podium rate)",
pad=14,
fontsize=14
)
plt.xlabel("Total Race Starts (log scale)")
plt.ylabel("Total Podiums (log scale)")
plt.tight_layout()
plt.show()
This bubble chart reveals three distinct patterns of F1 success. In the top-right corner, traditional powerhouses—Britain, Germany, France and Italy—combine thousands of starts with hundreds of podiums and maintain steady conversion rates around 15–18%, reflecting deep factory support and talent programs. Slightly lower down are Brazil, the U.S. and Australia, where strong national infrastructures and standout drivers yield mid-range volumes and rates in the low teens. On the left, small but extremely efficient programs—Argentina and Colombia (powered by Fangio and Montoya) and Finland (Häkkinen, Räikkönen, Bottas)—achieve podium-per-start rates above 20% despite far fewer entries. Specialist outliers such as Austria, Monaco, New Zealand and the Netherlands also exceed expectations through one or two iconic talents. By charting starts, podiums and conversion rate together, this view distinguishes between enduring national depth and “one-star” spikes, while highlighting Finland’s rare blend of breadth and peak efficiency.
Summary table (top 10)¶
Finally, a compact table lists the top ten nationalities by podium count alongside their total starts and conversion rates. Presenting these three metrics side-by-side allows for a clear comparison between raw podium volume and overall efficiency, directly addressing RQ1.
md = metrics.sort_values('podiums', ascending=False).head(10).reset_index()
print("| Nationality | Starts | Podiums | Rate (%) |")
print("|-------------|-------:|--------:|---------:|")
for _, r in md.iterrows():
print(f"| {r['nationality']} | {int(r['starts']):,} | {int(r['podiums']):,} | {r['rate']*100:.1f}% |")
| Nationality | Starts | Podiums | Rate (%) | |-------------|-------:|--------:|---------:| | British | 4,418 | 737 | 16.7% | | German | 2,384 | 415 | 17.4% | | French | 3,004 | 309 | 10.3% | | Brazilian | 1,953 | 293 | 15.0% | | Finnish | 1,147 | 245 | 21.4% | | Italian | 3,418 | 207 | 6.1% | | Australian | 821 | 130 | 15.8% | | American | 1,279 | 129 | 10.1% | | Austrian | 690 | 118 | 17.1% | | Spanish | 822 | 115 | 14.0% |
Podium Efficiency Over Debut Decades¶
In this part, the goal is to trace how podium-per-start efficiency has evolved across driver debut cohorts for the ten nations with the most total podiums. By grouping each driver into a “debut decade,” we can see whether countries have become more or less efficient over time—highlighting golden eras (e.g., Germany’s Schumacher spike), sustained programs (Britain’s steady climb), or emerging surges (Finland’s late-20th-century boom). The following code fills in missing decades, applies a consistent color palette, and annotates each nation’s most recent rate for clarity before rendering a polished line chart of these trends.
decade_stats = (
rq1
.groupby(["nationality", "debut_decade"])
.agg(entries=("podium", "size"), podiums=("podium", "sum"))
.assign(rate=lambda df: df.podiums / df.entries)
.reset_index()
)
top10 = metrics["podiums"].nlargest(10).index.tolist()
plot_data = decade_stats[decade_stats["nationality"].isin(top10)]
all_decades = sorted(plot_data["debut_decade"].unique())
full_grid = (
pd.MultiIndex.from_product([top10, all_decades], names=["nationality", "debut_decade"])
.to_frame(index=False)
)
plot_data = (
full_grid
.merge(plot_data, on=["nationality","debut_decade"], how="left")
.fillna({"rate": 0})
)
sns.set_theme(style="whitegrid")
plt.figure(figsize=(12, 6))
palette = sns.color_palette("tab10", n_colors=len(top10))
for i, nat in enumerate(top10):
df_nat = plot_data[plot_data["nationality"] == nat]
sns.lineplot(
data=df_nat,
x="debut_decade",
y="rate",
marker="o",
linewidth=2,
label=nat,
color=palette[i]
)
final = df_nat.iloc[-1]
plt.text(
final["debut_decade"] + 2,
final["rate"],
f"{final['rate']:.1%}",
color=palette[i],
va="center"
)
plt.title("Podium-Per-Start Rate by Debut Decade for Top 10 Nationalities", pad=12)
plt.xlabel("Debut Decade")
plt.ylabel("Podium-Per-Start Rate")
plt.gca().yaxis.set_major_formatter(PercentFormatter(1.0))
plt.xticks(all_decades)
plt.legend(title="Nationality", bbox_to_anchor=(1.02, 1), loc="upper left")
plt.tight_layout()
plt.show()
Decade-by-Decade Podium-Per-Start Trends
1950s–1960s:
- Italy kicked off around 25% (Fangio & Hawthorn), and Australia about 23%, before both dipped as grids deepened.
- Britain climbed from ~13% in the ’50s to ~18% in the ’60s, foreshadowing later factory dominance.
1970s–1980s:
- Brazil and Austria surged into the 20–23% range thanks to Piquet/Senna and Lauda/Prost.
- Finland first appears in the ’70s, then rises alongside France in the ’80s.
- Germany and the U.S. lingered in single digits as their programs rebuilt.
1990s:
- Germany spikes above 30%, driven by Schumacher’s Ferrari era.
- Britain climbs to ~23% under Stewart and the early McLaren resurgence.
- Finland reaches the teens courtesy of Häkkinen and Räikkönen, while Spain debuts on the chart with Alonso.
2000s:
- Britain peaks at 35% during the Stewart–Hamilton–Mercedes era.
- Finland holds around 23%, and Spain peaks near 24%.
- Brazil and Australia settle in the low teens, reflecting the post-Senna and Webber/Ricciardo generations.
2010s:
- Finland achieves its highest efficiency (~33%) through Bottas and the modern junior pipeline.
- Other nations see rates soften as grids expand and talent disperses.
These patterns underscore how individual dynasties—Schumacher in Germany, Hamilton in Britain, Häkkinen/Räikkönen/Bottas in Finland—produce dramatic peaks, while deeper national programs yield steadier, long-term performance.
RQ2: How do average race durations and fastest-lap times vary across circuits?¶
To understand how track design and layout influence race pace, we’ll compare two key metrics across all Grands Prix in our circuit_speed table:
- Average race duration (in seconds) – reflecting the overall time it takes to cover the ~305 km distance, including straights, corners and elevation changes.
- Average fastest-lap time (in seconds) – capturing the raw speed potential of each circuit on a single lap.
Overview of our Data¶
circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']] = (
circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']]
.apply(pd.to_numeric, errors='coerce')
)
print(circuit_speed[['avg_race_time_s','avg_fastest_lap_s','finishers']]
.describe().T, "\n")
count mean std min 25% \
avg_race_time_s 37.0 5944.807642 644.413202 4874.936075 5538.661404
avg_fastest_lap_s 37.0 89.252041 11.033869 69.058566 80.854662
finishers 37.0 99.432432 67.687821 8.000000 42.000000
50% 75% max
avg_race_time_s 5872.916157 6118.259891 8395.242333
avg_fastest_lap_s 88.704515 98.898157 110.672408
finishers 85.000000 159.000000 224.000000
Race duration:
– Mean: 5 945 s (≈ 99.1 min)
– IQR (25–75%): 5 538 s (92.3 min) to 6 118 s (101.9 min)
– Range: from 4 874 s (≈ 81.2 min at Monza’s flat-out layout) up to 8 395 s (≈ 140 min on slower, twisty street tracks)
Fastest-lap times:
– Mean: 89.3 s
– IQR: 80.9 s to 98.9 s
– Range: from a blistering 69.1 s on ultra-fast circuits to 110.7 s on the tightest, slowest venues
Finisher counts:
– Mean: 99 finishers per event
– IQR: 42 to 159
– Range: as few as 8 classified finishers (early eras or high-attrition races) versus 224 in modern, support-series-packed weekends
Together, these figures highlight how circuit characteristics—and even era-specific event formats—drive wide variation in both total race time and lap-time benchmarks across the F1 calendar.
Top 10 Circuits by Shortest Average Race Duration¶
Below is a list of the ten Grands Prix with the quickest average race times (in minutes), as calculated from our cleaned circuit_speed data. These high-speed venues—characterized by long straights, flowing layouts, or fewer total laps—consistently deliver some of the fastest full-distance races on the F1 calendar.
short_races = (
circuit_speed
.assign(avg_race_min=lambda df: df.avg_race_time_s/60)
.nsmallest(10, 'avg_race_min')[['name','avg_race_min']]
)
print(short_races.to_markdown(index=False, floatfmt=".2f"))
| name | avg_race_min | |:-------------------------------|---------------:| | Autodromo Nazionale di Monza | 81.25 | | Losail International Circuit | 85.46 | | Red Bull Ring | 86.31 | | Circuit de Spa-Francorchamps | 87.58 | | Circuit Paul Ricard | 89.43 | | Istanbul Park | 90.15 | | Buddh International Circuit | 91.87 | | Circuit de Nevers Magny-Cours | 91.88 | | Hockenheimring | 91.97 | | Albert Park Grand Prix Circuit | 92.31 |
The ten quickest Grands Prix by average race time all share one thing in common: maximum momentum.
- Monza leads the pack at just 81.25 minutes, its “Temple of Speed” layout of long straights and only a few slow corners letting cars stay at full throttle.
- Losail (85.46 min) and the Red Bull Ring (86.31 min) follow, their flowing high-speed bends and modest lap counts keeping total times low.
- Even the 7 km Spa-Francorchamps clocks in at a swift 87.58 min, as the legendary Eau Rouge–Kemmel straight compensates for its length.
- Tracks like Paul Ricard (89.43 min) and Istanbul Park (90.15 min) exploit long, sweeping sections and minimal tight chicanes to stay under 91 minutes.
- Buddh International (91.87 min), Magny-Cours (91.88 min) and Hockenheimring (91.97 min) trim lap counts and technical sectors for sub-92 min races.
- Even Albert Park—a street-style circuit—runs only 58 laps on broad, fast corners, finishing in 92.31 min.
Together, these venues prove that fewer slow corners and shorter lap totals, not necessarily overall track length, are the true recipes for the fastest Grands Prix.
Top 10 Fastest Circuits by Average Lap Time¶
To pinpoint the venues where drivers achieve their quickest single-lap paces, we extract the ten circuits with the lowest average fastest-lap times from our circuit_speed table:
fast_laps = circuit_speed.nsmallest(10, 'avg_fastest_lap_s')[['name','avg_fastest_lap_s']]
print(fast_laps.to_markdown(index=False, floatfmt=".2f"))
| name | avg_fastest_lap_s | |:-------------------------------------|--------------------:| | Red Bull Ring | 69.06 | | Indianapolis Motor Speedway | 72.60 | | Circuit Park Zandvoort | 74.41 | | Autódromo José Carlos Pace | 75.18 | | Circuit de Nevers Magny-Cours | 76.90 | | Circuit Gilles Villeneuve | 77.08 | | Circuit de Monaco | 77.13 | | Hockenheimring | 77.47 | | Autodromo Internazionale del Mugello | 80.79 | | Autodromo Enzo e Dino Ferrari | 80.85 |
The Red Bull Ring’s ultra-short, flowing layout delivers by far the quickest average lap (69.06 s), while Indianapolis’ wide, oval-inspired straights come second (72.60 s). Zandvoort and Interlagos follow, their combination of high-speed corners and long straights rewarding low-downforce setups. Even street-style Monaco cracks the top ten, where its consistent—but slower—77 s laps reflect the tight, precision-demanding nature of the circuit. Hockenheim’s flat-out straights, Magny-Cours’ sweeping curves and Mugello’s undulating terrain all demonstrate how track geometry drives lap-time performance, confirming that these ten venues represent the true “speed temples” of the F1 calendar.
Global Map of Top-10 Circuit Speed & Duration¶
To explore how circuit design influences both overall race length and lap speed, we plot:
- Blue markers: the 10 Grands Prix with the shortest average race durations
- Red markers: the 10 circuits with the quickest average fastest-lap times
You can toggle between the default street map and satellite imagery, and click any marker to view its name and key metric.
short_races = (
circuit_speed
.assign(avg_race_min=lambda df: df.avg_race_time_s / 60)
.nsmallest(10, 'avg_race_min')
.reset_index(drop=True)
)
fast_laps = (
circuit_speed
.nsmallest(10, 'avg_fastest_lap_s')
.reset_index(drop=True)
)
short_geo = short_races.merge(
circuits[['circuitId','lat','lng']],
on='circuitId', how='left'
)
fast_geo = fast_laps.merge(
circuits[['circuitId','lat','lng']],
on='circuitId', how='left'
)
m = folium.Map(
location=[short_geo['lat'].mean(), short_geo['lng'].mean()],
zoom_start=3,
tiles=None,
control_scale=True
)
folium.TileLayer('OpenStreetMap', name='Default Map', show=True).add_to(m)
folium.TileLayer('Esri.WorldImagery', name='Satellite View', show=False).add_to(m)
fg_short = folium.FeatureGroup(name='Top 10 Shortest Races').add_to(m)
for _, row in short_geo.iterrows():
fg_short.add_child(folium.CircleMarker(
location=[row['lat'], row['lng']],
radius=7,
color='blue',
fill=True,
fill_color='blue',
fill_opacity=0.7,
popup=(
f"<b>{row['name']}</b><br>"
f"Avg Race Duration: {row['avg_race_min']:.1f} min"
),
tooltip=row['name']
))
fg_fast = folium.FeatureGroup(name='Top 10 Fastest-Lap Circuits').add_to(m)
for _, row in fast_geo.iterrows():
fg_fast.add_child(folium.CircleMarker(
location=[row['lat'], row['lng']],
radius=7,
color='red',
fill=True,
fill_color='red',
fill_opacity=0.7,
popup=(
f"<b>{row['name']}</b><br>"
f"Avg Fastest Lap: {row['avg_fastest_lap_s']:.2f} s"
),
tooltip=row['name']
))
folium.LayerControl(collapsed=False).add_to(m)
m
The world map reveals a clear geographic clustering of both the shortest races (blue) and the quickest fastest-lap circuits (red):
Europe as the focal point
Most of the blue markers fall in Western Europe—Monza (Italy), Spa-Francorchamps (Belgium), Paul Ricard (France) and Hockenheimring (Germany)—where high-speed, permanent road courses produce sub-90-minute race durations. The handful of outliers in the Middle East (Bahrain) and Asia-Pacific (Albert Park, Melbourne) reflect similarly flat, high-speed layouts.Speedy lap specialists
The red markers highlight circuits renowned for outright lap speed: Red Bull Ring (Austria), Zandvoort (Netherlands), Interlagos (Brazil) and Monaco (street circuit). Their tight clustering in Europe and two lone outposts in Brazil and the U.S. (Indianapolis) demonstrate that both permanent and historic street venues can produce blistering lap times.Track design matters
Shorter average race times correlate with long straights and few slow corners (e.g. Monza’s “Temple of Speed”), whereas fastest-lap performance also rewards high-grip pavement and modern resurfacing (e.g. Zandvoort’s banked corners). Conversely, twisty or high-altitude tracks—despite having the same total distance—stretch race durations and slow down individual laps.
Section 4: Discussion¶
Across seven decades of F1 history, a handful of nations—led by Britain (737 podiums) and Germany (415)—dominate in raw podium counts, but efficiency tells a richer story: Argentina (26.3%) and Colombia (24.0%) owe their sky-high conversion rates to single legends (Fangio, Montoya), while smaller yet deep programs like Finland (21.4%) punch above their weight through a stable of champions (Häkkinen, Räikkönen, Bottas). Decade-by-decade trends reveal dynastic spikes—Schumacher’s 1990s surge in Germany and Hamilton’s 2000s peak in Britain—alongside emerging powerhouses from Spain and the Nordics.
On the circuit front, permanent “temples of speed” (Monza, Spa, Paul Ricard) deliver the shortest races, whereas a mix of historic street tracks (Monaco, Zandvoort) and modern ovals (Indianapolis) produce the fastest laps. Mapping these two dimensions highlights how track design—straight-line speed, corner profile, surface and elevation—influences both overall race duration and peak lap performance, rounding out a picture of F1 as a sport shaped equally by national talent pipelines and the unique characteristics of each venue.